-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14388][SQL] Implement CREATE TABLE #12271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We need to reconcile the differences between what's added here in SparkSqlParser and HiveSqlParser. That will come in the next commit. This currently still fails tests, obviously because create table is not implemented yet!
Before: CatalogTable has schema, partitionColumns and sortColumns. There are no constraints between the 3. However, Hive will complain if schema and partitionColumns overlap. After: CatalogTable has schema, partitionColumnNames, sortColumnNames, bucketColumnNames and skewColumnNames. All the columns must be a subset of schema. This means splitting up schema into (schema, partitionCols) before passing it to Hive. This allows us to store the columns more uniformly. Otherwise partition columns would be the odd one out. This commit also fixes "alter table bucketing", which was incorrectly using partition columns as bucket columns.
This involves reverting part of the changes in an earlier commit, where we tried to implement the parsing logic in the general SQL parser and introduced a bunch of case classes that we won't end up using. As of this commit the actual CREATE TABLE logic is not there yet. It will come in a future commit.
| * ALTER TABLE table1 RENAME TO table2; | ||
| * ALTER VIEW view1 RENAME TO view2; | ||
| * }}} | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to tables.scala. I just moved this one command for now to avoid inflating the diff too much.
|
|
||
| fileFormat | ||
| : INPUTFORMAT inFmt=STRING OUTPUTFORMAT outFmt=STRING (SERDE serdeCls=STRING)? | ||
| (INPUTDRIVER inDriver=STRING OUTPUTDRIVER outDriver=STRING)? #tableFileFormat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hvanhovell I deleted the INPUTDRIVER and OUTPUTDRIVER here because Hive doesn't support it. Why was this added in the first place? Is there any supporting documentation for this somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewor14 I wanted to make sure we supported the same grammar as Hive and I used their grammars as a basis. So this is defined in the following two locations:
- https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/HiveParser.g#L1335-L1336
- https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/HiveParser.g#L1925-L1926
The main idea was that I could throw better errors. But if it is not supported by Hive itself then please remove it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove it. I have never seen this before and it is not documented anywhere.
|
Test build #55425 has finished for PR 12271 at commit
|
|
Test build #55426 has finished for PR 12271 at commit
|
|
@andrewor14 Just want to let you know, @xwu0226 is doing the command |
| * [COMMENT table_comment] | ||
| * [PARTITIONED BY (col3 data_type [COMMENT col_comment], ...)] | ||
| * [CLUSTERED BY (col1, ...) [SORTED BY (col1 [ASC|DESC], ...)] INTO num_buckets BUCKETS] | ||
| * [SKEWED BY (col1, col2, ...) ON ((col_value, col_value, ...), ...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewor14 I just did a quick check with our InsertIntoHiveTable command. Seems this command does not really understand how to handle a table having specifications on CLUSTERED BY, SORTED BY or SKEWED BY. How about we just throw exceptions when a define provide these specs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
Test build #55552 has finished for PR 12271 at commit
|
|
I'll have a look at the failing tests in a couple of hours. |
| * [COMMENT table_comment] | ||
| * [PARTITIONED BY (col3 data_type [COMMENT col_comment], ...)] | ||
| * [CLUSTERED BY (col1, ...) [SORTED BY (col1 [ASC|DESC], ...)] INTO num_buckets BUCKETS] | ||
| * [SKEWED BY (col1, col2, ...) ON ((col_value, col_value, ...), ...) [STORED AS DIRECTORIES]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this line while resolving the conflicts?
Previously we always converted the data type string to lower case. However, for struct fields this also converts the struct field names to lower case. This is not what tests (or perhaps user code) expects.
|
Test build #55658 has finished for PR 12271 at commit
|
|
Test build #55666 has finished for PR 12271 at commit
|
|
I've ignored some tests in |
|
Test build #55674 has finished for PR 12271 at commit
|
| "date_join1", | ||
| "date_serde", | ||
| "decimal_1", | ||
| //"decimal_1", // TODO: cannot parse column decimal(5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea. We should support decimal(5). At here, the scale is 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // just convert the whole type string to lower case, otherwise the struct field names | ||
| // will no longer be case sensitive. Instead, we rely on our parser to get the proper | ||
| // case before passing it to Hive. | ||
| HiveMetastoreTypes.toDataType(col.dataType.getText).simpleString, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to use parseDataType provided in AbstractSqlParser.
There were a few differences in DESCRIBE TABLE: - output format should be HiveIgnoreKeyTextOutputFormat - num buckets should be -1 - last access time should be -1 - EXTERNAL should not be set to false for managed table After making these changes out result now matches Hive's.
CatalystSqlParser knows how to parse decimal(5)!
| // just convert the whole type string to lower case, otherwise the struct field names | ||
| // will no longer be case sensitive. Instead, we rely on our parser to get the proper | ||
| // case before passing it to Hive. | ||
| CatalystSqlParser.parseDataType(col.dataType.getText).simpleString, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MINOR/NIT: The DataType parsing is done in the AstBuilder, so we really don't need to parse this string again. You could use (magical/evil) typedVisit[DataType](col.dataType) here.
|
Test build #55692 has finished for PR 12271 at commit
|
|
Test build #55693 has finished for PR 12271 at commit
|
|
Test build #55696 has finished for PR 12271 at commit
|
| table.storage.outputFormat.map(toOutputFormat).foreach(hiveTable.setOutputFormatClass) | ||
| table.storage.serde.foreach(hiveTable.setSerializationLib) | ||
| hiveTable.setSerializationLib( | ||
| table.storage.serde.getOrElse("org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... Without this PR, the storage related fields of the table defined in org.apache.spark.sql.hive.MetastoreDataSourcesSuite.persistent JSON table are:
[SerDe Library: org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe ]
[InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat ]
[OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat ]
Looks like the change at here somehow breaks the test (without the change, we can somehow trigger a weird code path in Hive and get at least have one column called col).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, if the schema does not have any field, can we set it to the following one?
[# col_name data_type comment ]
[ ]
[col array<string> from deserializer ]
So, we try to preserve the existing (and weird) behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are there all these undocumented implicit behaviors in Hive :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, after a few bad experiences, I think we need to add all the test cases to ensure the Hive APIs work as what we expected. See another issue I hit in alter table ... drop partition: #12220 (diff)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel it is better to have more checks in SessionCatalog to make sure that a request is valid (it can help your case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agree if you are talking about my case. : )
|
OK, I cherry-picked your changes and updated the comment. |
|
Test build #55727 has finished for PR 12271 at commit
|
|
OK. Merging to master. |
What changes were proposed in this pull request?
This patch implements the
CREATE TABLEcommand using theSessionCatalog. Previously we handled onlyCTASandCREATE TABLE ... USING. This requires us to refactorCatalogTableto accept various fields (e.g. bucket and skew columns) and pass them to Hive.WIP: Note that I haven't verified whether this actually works yet! But I believe it does.
How was this patch tested?
Tests will come in a future commit.